Biostat 212a Homework 4

Due Mar. 5, 2024 @ 11:59PM

Author

Yang An and UID: 106332601

Published

March 2, 2024

1 ISL Exercise 8.4.3 (10pts)

  1. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that dis- plays each of these quantities as a function of pˆm1. The x-axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
p1 <- seq(0, 1, 0.01)
p2 <- 1 - p1
gini <- 2 * p1 * p2
class.error <- 1 - pmax(p1, p2)
entropy <- -pmax(p1, p2) * log2(pmax(p1, p2)) - pmin(p1, p2) * log2(pmin(p1, p2))
matplot(p1, cbind(gini, class.error, entropy), ylab = "Gini index, Classification error, Entropy", type = "l", col = c("green", "blue", "orange"))
legend("topright",legend=c("Gini index","Class.error", "Entropy"),pch=20,col=c("green", "blue", "orange"))

2 ISL Exercise 8.4.4 (10pts)

FIGURE 8.14. Left: A partition of the predictor space corresponding to Exer- cise 4a. Right: A tree corresponding to Exercise 4b. This question relates to the plots in Figure 8.14. (a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The num- bers inside the boxes indicate the mean of Y within each region. If X1≥1 then 5, else if X2≥1 then 15, else if X1<0 then 3, else if X2<0 then 10, else 0.

library(knitr)
include_graphics("/Users/yangan/Desktop/212A/212a-hw/hw4/8.4.4(a).jpg")

  1. Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
# (b)
par(xpd = NA)
plot(NA, NA, type = "n", xlim = c(-2, 2), ylim = c(-3, 3), xlab = "X1", ylab = "X2")
# X2 < 1
lines(x = c(-2, 2), y = c(1, 1))
# X1 < 1 with X2 < 1
lines(x = c(1, 1), y = c(-3, 1))
text(x = (-2 + 1)/2, y = -1, labels = c(-1.8))
text(x = 1.5, y = -1, labels = c(0.63))
# X2 < 2 with X2 >= 1
lines(x = c(-2, 2), y = c(2, 2))
text(x = 0, y = 2.5, labels = c(2.49))
# X1 < 0 with X2<2 and X2>=1
lines(x = c(0, 0), y = c(1, 2))
text(x = -1, y = 1.5, labels = c(-1.06))
text(x = 1, y = 1.5, labels = c(0.21))

3 ISL Exercise 8.4.5 (10pts)

  1. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
p <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
# Average probability
mean(p)
[1] 0.45

In this case, the most common prediction is 0.6, which occurs twice. With the majority vote approach, we classify X as Red as it is the most commonly occurring class among the 10 predictions (6 for Red vs 4 for Green). With the average probability approach, we classify X as Green as the average of the 10 probabilities is 0.45.

4 ISL Lab 8.3. Boston data set (30pts)

Follow the machine learning workflow to train regression tree, random forest, and boosting methods for predicting medv. Evaluate out-of-sample performance on a test set.

rm(list = ls())
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)
library(ISLR2)
library(MASS)
library(tidymodels)
library(rpart)
library(rpart.plot)
library(vip)
library(randomForest)
library(gbm)
library(xgboost)
# Load the Boston data set
data(Boston)
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
Boston %>% tbl_summary()
Characteristic N = 5061
crim 0.3 (0.1, 3.7)
zn 0 (0, 13)
indus 9.7 (5.2, 18.1)
chas 35 (6.9%)
nox 0.54 (0.45, 0.62)
rm 6.21 (5.89, 6.62)
age 78 (45, 94)
dis 3.21 (2.10, 5.19)
rad
    1 20 (4.0%)
    2 24 (4.7%)
    3 38 (7.5%)
    4 110 (22%)
    5 115 (23%)
    6 26 (5.1%)
    7 17 (3.4%)
    8 24 (4.7%)
    24 132 (26%)
tax 330 (279, 666)
ptratio 19.05 (17.40, 20.20)
black 391 (375, 396)
lstat 11 (7, 17)
medv 21 (17, 25)
1 Median (IQR); n (%)
Boston <- Boston %>% filter(!is.na(medv))
# Split the data into training and test sets
set.seed(203)
data_split <- initial_split(Boston, prop = 0.5)
Bonston_train <- training(data_split)
Bonston_test <- testing(data_split)
tree_recipe <- 
  recipe(
    medv ~ ., 
    data = Bonston_train
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  step_naomit(medv) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  #  center and scale numeric data
  step_log(medv) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  # step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep(training = Bonston_train, retain = TRUE)
tree_recipe
tree_recipe_spec <- 
  recipe(
    medv ~ ., 
    data = Bonston_train
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  step_naomit(medv) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>%
 #  center and scale numeric data
  step_log(medv) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) 

tree_recipe_spec
# regression tree model
regtree_mod <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = 5,
  mode = "regression",
  engine = "rpart"
)

# random forest model
rf_mod <- rand_forest(
  mode = "regression",
  engine = "randomForest",
  trees = 500,
  mtry = tune()
)

# boosting model
boost_mod <- boost_tree(
  mode = "regression",
  engine = "xgboost",
  trees = 500,
  mtry = tune(),
  learn_rate = tune()
)
# regression tree model
tree_wf <- workflow() %>%
  add_recipe(tree_recipe_spec) %>%
  add_model(regtree_mod)

# random forest model
rf_wf <- workflow() %>%
  add_recipe(tree_recipe_spec) %>%
  add_model(rf_mod)

# boosting model
boost_wf <- workflow() %>%
  add_recipe(tree_recipe_spec) %>%
  add_model(boost_mod)
# grid for regression tree model
tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  levels = c(100, 5)
)

# grid for random forest model
rf_grid <- grid_regular(
  mtry(range = c(2, ncol(Bonston_train) - 1)),
  levels = 5
)

# grid for boosting model
boost_grid <- grid_regular(
  mtry(range = c(2, ncol(Bonston_train) - 1)),
  learn_rate(range = c(0.01, 0.1)),
  levels = 5
)
#### R Set cross-validation partitions.
set.seed(203)

folds <- vfold_cv(Bonston_train, v = 5)
folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [202/51]> Fold1
2 <split [202/51]> Fold2
3 <split [202/51]> Fold3
4 <split [203/50]> Fold4
5 <split [203/50]> Fold5
# cv for regression tree model
tree_fit <- tree_wf %>%
  tune_grid(
    resamples = folds,
    grid = tree_grid,
    metrics = metric_set(rmse, rsq)
  )
tree_fit
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics             .notes          
  <list>           <chr> <list>               <list>          
1 <split [202/51]> Fold1 <tibble [1,000 × 6]> <tibble [0 × 3]>
2 <split [202/51]> Fold2 <tibble [1,000 × 6]> <tibble [0 × 3]>
3 <split [202/51]> Fold3 <tibble [1,000 × 6]> <tibble [0 × 3]>
4 <split [203/50]> Fold4 <tibble [1,000 × 6]> <tibble [0 × 3]>
5 <split [203/50]> Fold5 <tibble [1,000 × 6]> <tibble [0 × 3]>
# cv for random forest model
rf_fit <- rf_wf %>%
  tune_grid(
    resamples = folds,
    grid = rf_grid,
    metrics = metric_set(rmse, rsq)
  )
rf_fit
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [202/51]> Fold1 <tibble [10 × 5]> <tibble [0 × 3]>
2 <split [202/51]> Fold2 <tibble [10 × 5]> <tibble [0 × 3]>
3 <split [202/51]> Fold3 <tibble [10 × 5]> <tibble [0 × 3]>
4 <split [203/50]> Fold4 <tibble [10 × 5]> <tibble [0 × 3]>
5 <split [203/50]> Fold5 <tibble [10 × 5]> <tibble [0 × 3]>
# cv for boosting model
boost_fit <- boost_wf %>%
  tune_grid(
    resamples = folds,
    grid = boost_grid,
    metrics = metric_set(rmse, rsq)
  )
boost_fit
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [202/51]> Fold1 <tibble [50 × 6]> <tibble [0 × 3]>
2 <split [202/51]> Fold2 <tibble [50 × 6]> <tibble [0 × 3]>
3 <split [202/51]> Fold3 <tibble [50 × 6]> <tibble [0 × 3]>
4 <split [203/50]> Fold4 <tibble [50 × 6]> <tibble [0 × 3]>
5 <split [203/50]> Fold5 <tibble [50 × 6]> <tibble [0 × 3]>
#Visualize CV results:
tree_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  mutate(tree_depth = as.factor(tree_depth)) %>%
  ggplot(mapping = aes(x = cost_complexity, y = mean, color = tree_depth)) +
  geom_point() + 
  geom_line() + 
  labs(x = "cost_complexity", y = "CV mse")
# A tibble: 1,000 × 8
   cost_complexity tree_depth .metric .estimator  mean     n std_err
             <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl>
 1        1   e-10          1 rmse    standard   0.736     5  0.0548
 2        1   e-10          1 rsq     standard   0.459     5  0.0304
 3        1.23e-10          1 rmse    standard   0.736     5  0.0548
 4        1.23e-10          1 rsq     standard   0.459     5  0.0304
 5        1.52e-10          1 rmse    standard   0.736     5  0.0548
 6        1.52e-10          1 rsq     standard   0.459     5  0.0304
 7        1.87e-10          1 rmse    standard   0.736     5  0.0548
 8        1.87e-10          1 rsq     standard   0.459     5  0.0304
 9        2.31e-10          1 rmse    standard   0.736     5  0.0548
10        2.31e-10          1 rsq     standard   0.459     5  0.0304
   .config               
   <chr>                 
 1 Preprocessor1_Model001
 2 Preprocessor1_Model001
 3 Preprocessor1_Model002
 4 Preprocessor1_Model002
 5 Preprocessor1_Model003
 6 Preprocessor1_Model003
 7 Preprocessor1_Model004
 8 Preprocessor1_Model004
 9 Preprocessor1_Model005
10 Preprocessor1_Model005
# ℹ 990 more rows

rf_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = mtry, y = mean)) +
  geom_point() + 
  geom_line() + 
  labs(x = "mtry", y = "CV mse")
# A tibble: 10 × 7
    mtry .metric .estimator  mean     n std_err .config             
   <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
 1     2 rmse    standard   0.457     5  0.0399 Preprocessor1_Model1
 2     2 rsq     standard   0.786     5  0.0697 Preprocessor1_Model1
 3     4 rmse    standard   0.416     5  0.0440 Preprocessor1_Model2
 4     4 rsq     standard   0.808     5  0.0665 Preprocessor1_Model2
 5     7 rmse    standard   0.402     5  0.0444 Preprocessor1_Model3
 6     7 rsq     standard   0.816     5  0.0595 Preprocessor1_Model3
 7    10 rmse    standard   0.402     5  0.0453 Preprocessor1_Model4
 8    10 rsq     standard   0.814     5  0.0575 Preprocessor1_Model4
 9    13 rmse    standard   0.407     5  0.0442 Preprocessor1_Model5
10    13 rsq     standard   0.811     5  0.0547 Preprocessor1_Model5

boost_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = mtry, y = mean, color = learn_rate)) +
  geom_point() + 
  geom_line() + 
  labs(x = "mtry", y = "CV mse")
# A tibble: 50 × 8
    mtry learn_rate .metric .estimator  mean     n std_err .config              
   <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     2       1.02 rmse    standard   0.557     5  0.0458 Preprocessor1_Model01
 2     2       1.02 rsq     standard   0.674     5  0.0875 Preprocessor1_Model01
 3     4       1.02 rmse    standard   0.515     5  0.0295 Preprocessor1_Model02
 4     4       1.02 rsq     standard   0.738     5  0.0428 Preprocessor1_Model02
 5     7       1.02 rmse    standard   0.512     5  0.0411 Preprocessor1_Model03
 6     7       1.02 rsq     standard   0.723     5  0.0541 Preprocessor1_Model03
 7    10       1.02 rmse    standard   0.484     5  0.0234 Preprocessor1_Model04
 8    10       1.02 rsq     standard   0.751     5  0.0449 Preprocessor1_Model04
 9    13       1.02 rmse    standard   0.507     5  0.0384 Preprocessor1_Model05
10    13       1.02 rsq     standard   0.739     5  0.0515 Preprocessor1_Model05
# ℹ 40 more rows

tree_fit %>%
  show_best("rmse")
# A tibble: 5 × 8
  cost_complexity tree_depth .metric .estimator  mean     n std_err .config     
            <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
1         0.00351          8 rmse    standard   0.488     5  0.0537 Preprocesso…
2         0.00351         11 rmse    standard   0.488     5  0.0537 Preprocesso…
3         0.00351         15 rmse    standard   0.488     5  0.0537 Preprocesso…
4         0.00285          8 rmse    standard   0.490     5  0.0545 Preprocesso…
5         0.00285         11 rmse    standard   0.490     5  0.0545 Preprocesso…
rf_fit %>%
  show_best("rmse")
# A tibble: 5 × 7
   mtry .metric .estimator  mean     n std_err .config             
  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1    10 rmse    standard   0.402     5  0.0453 Preprocessor1_Model4
2     7 rmse    standard   0.402     5  0.0444 Preprocessor1_Model3
3    13 rmse    standard   0.407     5  0.0442 Preprocessor1_Model5
4     4 rmse    standard   0.416     5  0.0440 Preprocessor1_Model2
5     2 rmse    standard   0.457     5  0.0399 Preprocessor1_Model1
boost_fit %>%
  show_best("rmse")
# A tibble: 5 × 8
   mtry learn_rate .metric .estimator  mean     n std_err .config              
  <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    10       1.02 rmse    standard   0.484     5  0.0234 Preprocessor1_Model04
2    13       1.14 rmse    standard   0.491     5  0.0447 Preprocessor1_Model15
3    10       1.08 rmse    standard   0.491     5  0.0443 Preprocessor1_Model09
4     7       1.08 rmse    standard   0.504     5  0.0136 Preprocessor1_Model08
5    13       1.02 rmse    standard   0.507     5  0.0384 Preprocessor1_Model05

Let’s select the best model.

best_tree <- tree_fit %>%
  select_best("rmse")
best_tree
# A tibble: 1 × 3
  cost_complexity tree_depth .config               
            <dbl>      <int> <chr>                 
1         0.00351          8 Preprocessor1_Model284
best_rf <- rf_fit %>%
  select_best("rmse")
best_rf
# A tibble: 1 × 2
   mtry .config             
  <int> <chr>               
1    10 Preprocessor1_Model4
best_boost <- boost_fit %>%
  select_best("rmse")
best_boost
# A tibble: 1 × 3
   mtry learn_rate .config              
  <int>      <dbl> <chr>                
1    10       1.02 Preprocessor1_Model04
# Final workflow
final_wftree <- tree_wf %>%
  finalize_workflow(best_tree)
final_wftree
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 0.00351119173421513
  tree_depth = 8
  min_n = 5

Computational engine: rpart 
final_wfrf <- rf_wf %>%
  finalize_workflow(best_rf)
final_wfrf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  trees = 500

Computational engine: randomForest 
final_wfboost <- boost_wf %>%
  finalize_workflow(best_boost)
final_wfboost
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  mtry = 10
  trees = 500
  learn_rate = 1.02329299228075

Computational engine: xgboost 
# Fit the whole training set, then predict the test cases
final_fittree <- 
  final_wftree %>%
  last_fit(data_split)
final_fittree
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [253/253]> train/test split <tibble> <tibble> <tibble>     <workflow>
final_fitrf <- 
  final_wfrf %>%
  last_fit(data_split)
final_fitrf
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [253/253]> train/test split <tibble> <tibble> <tibble>     <workflow>
final_fitboost <- 
  final_wfboost %>%
  last_fit(data_split)
final_fitboost
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [253/253]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fittree %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.575 Preprocessor1_Model1
2 rsq     standard       0.679 Preprocessor1_Model1
final_fitrf %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.367 Preprocessor1_Model1
2 rsq     standard       0.843 Preprocessor1_Model1
final_fitboost %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.545 Preprocessor1_Model1
2 rsq     standard       0.729 Preprocessor1_Model1
# Visualize the final model
library(rpart.plot)
library(randomForest)
library(xgboost)
library(data.table)
final_tree <- extract_workflow(final_fittree)
final_tree
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
n= 253 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 253 252.00000000 -1.721943e-15  
    2) lstat>=0.2389889 90  52.87544000 -9.274618e-01  
      4) crim>=0.4172743 39  19.49364000 -1.457369e+00  
        8) lstat>=1.596959 11   5.76440300 -2.135433e+00  
         16) dis< -1.083781 2   0.14677250 -3.043921e+00 *
         17) dis>=-1.083781 9   3.60010500 -1.933546e+00  
           34) dis>=-1.029693 6   0.37089320 -2.172958e+00 *
           35) dis< -1.029693 3   2.19748600 -1.454722e+00 *
        9) lstat< 1.596959 28   6.68489700 -1.190986e+00  
         18) crim>=0.9254287 11   2.12120200 -1.503941e+00 *
         19) crim< 0.9254287 17   2.78924500 -9.884866e-01  
           38) rm< -0.6474225 4   0.37970260 -1.539559e+00 *
           39) rm>=-0.6474225 13   0.82105820 -8.189259e-01 *
      5) crim< 0.4172743 51  14.05606000 -5.222389e-01  
       10) crim>=-0.4060389 45   9.65170800 -6.259578e-01  
         20) black< 0.2746026 23   5.39535000 -8.422694e-01  
           40) age>=1.071618 2   1.76464300 -1.586705e+00 *
           41) age< 1.071618 21   2.41678100 -7.713708e-01 *
         21) black>=0.2746026 22   2.05507000 -3.998140e-01  
           42) dis< -1.00852 3   0.02951063 -9.425904e-01 *
           43) dis>=-1.00852 19   1.00219000 -3.141124e-01 *
       11) crim< -0.4060389 6   0.28956410  2.556533e-01 *
    3) lstat< 0.2389889 163  78.96246000  5.120955e-01  
      6) rm< 0.5276014 113  21.39101000  1.634739e-01  
       12) dis>=-1.186972 111  13.88784000  1.288850e-01  
         24) rm< -0.2633651 52   5.24499700 -5.418782e-02  
           48) ptratio>=1.016231 7   1.29369100 -4.147737e-01 *
           49) ptratio< 1.016231 45   2.89957100  1.903327e-03 *
         25) rm>=-0.2633651 59   5.36399100  2.902374e-01  
           50) tax>=-1.084652 55   3.39454500  2.437258e-01  
            100) lstat>=-0.4004746 17   0.75930570 -1.817869e-04 *
            101) lstat< -0.4004746 38   1.17144900  3.528423e-01 *
           51) tax< -1.084652 4   0.21444600  9.297718e-01 *
       13) dis< -1.186972 2   0.00000000  2.083157e+00 *
      7) rm>=0.5276014 50  12.79965000  1.299980e+00  
       14) rm< 1.659133 34   3.66122100  1.018091e+00  
         28) dis>=-0.8659154 32   2.35915800  9.745802e-01  
           56) rm< 0.7244216 8   0.58169800  6.630926e-01 *
           57) rm>=0.7244216 24   0.74253100  1.078409e+00 *
         29) dis< -0.8659154 2   0.27215970  1.714267e+00 *
       15) rm>=1.659133 16   0.69565820  1.898994e+00 *
final_tree %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

final_rf <- extract_workflow(final_fitrf)
final_rf
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────

Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, mtry = min_cols(~10L,      x)) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 10

          Mean of squared residuals: 0.136993
                    % Var explained: 86.25
final_rf %>%
  extract_fit_engine() %>%
  randomForest::varImpPlot()

final_boost <- extract_workflow(final_fitboost)
final_boost
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 388.7 Kb 
call:
  xgboost::xgb.train(params = list(eta = 1.02329299228075, max_depth = 6, 
    gamma = 0, colsample_bytree = 1, colsample_bynode = 0.769230769230769, 
    min_child_weight = 1, subsample = 1), data = x$data, nrounds = 500, 
    watchlist = x$watchlist, verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
  eta = "1.02329299228075", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.769230769230769", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 13 
niter: 500
nfeatures : 13 
evaluation_log:
    iter training_rmse
       1  0.3077342515
       2  0.2053184932
---                   
     499  0.0005818133
     500  0.0005818133
boost_model <- final_boost$fit$fit$fit
importance <- xgb.importance(model = boost_model)
importance_df <- as.data.table(importance)
xgboost::xgb.plot.importance(importance_matrix = importance_df)

library(vip)

final_tree %>% 
  extract_fit_parsnip() %>% 
  vip()

final_rf %>%
  extract_fit_parsnip() %>%
  vip()

final_boost %>%
  extract_fit_parsnip() %>%
  vip()

5 ISL Lab 8.3 Carseats data set (30pts)

Follow the machine learning workflow to train classification tree, random forest, and boosting methods for classifying Sales <= 8 versus Sales > 8. Evaluate out-of-sample performance on a test set.

rm(list = ls())
# Load necessary libraries
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)
library(ISLR2)

# Load the dataset
Sales <- ISLR2::Carseats %>%
  mutate(Sales = ifelse(Sales <= 8, "Low", "High")) 
  # Check the structure of the dataset
glimpse(Sales)
Rows: 400
Columns: 11
$ Sales       <chr> "High", "High", "High", "Low", "Low", "High", "Low", "High…
$ CompPrice   <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, 117…
$ Income      <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35, 2…
$ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, 0, …
$ Population  <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 503,…
$ Price       <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, 136…
$ ShelveLoc   <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medium,…
$ Age         <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53, 52…
$ Education   <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18, 18…
$ Urban       <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Ye…
$ US          <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes, N…
Sales %>% tbl_summary()
Characteristic N = 4001
Sales
    High 164 (41%)
    Low 236 (59%)
CompPrice 125 (115, 135)
Income 69 (43, 91)
Advertising 5.0 (0.0, 12.0)
Population 272 (139, 399)
Price 117 (100, 131)
ShelveLoc
    Bad 96 (24%)
    Good 85 (21%)
    Medium 219 (55%)
Age 55 (40, 66)
Education
    10 48 (12%)
    11 48 (12%)
    12 49 (12%)
    13 43 (11%)
    14 40 (10%)
    15 36 (9.0%)
    16 47 (12%)
    17 49 (12%)
    18 40 (10%)
Urban 282 (71%)
US 258 (65%)
1 n (%); Median (IQR)
# For reproducibility
set.seed(212)

data_split <- initial_split(
  Sales, 
  prop = 0.5,
  strata = Sales
  )
data_split
<Training/Testing/Total>
<200/200/400>
Sales_other <- training(data_split)
dim(Sales_other)
[1] 200  11
Sales_test <- testing(data_split)
dim(Sales_test)
[1] 200  11
# Define a recipe for preprocessing
sales_recipe <- recipe(Sales ~ ., data = Sales_other) %>%
  step_naomit(all_predictors()) %>%
  # create traditional dummy variables (not necessary for random forest in R)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data (not necessary for random forest)
  step_normalize(all_numeric_predictors())

# Print the recipe
sales_recipe
# Define model specifications
classtree_mod <- decision_tree(
  cost_complexity = tune(), 
  tree_depth = tune(),
  min_n = 5,
  mode = "classification",
  engine = "rpart"
  )
classtree_mod
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()
  min_n = 5

Computational engine: rpart 
classrf_mod <- 
  rand_forest(
    mode = "classification",
    # Number of predictors randomly sampled in each split
    mtry = tune(),
    # Number of trees in ensemble
   trees = tune(),
   engine = "ranger"
  )
  classrf_mod
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger 
classboost_mod <- boost_tree(
  mode = "classification",
  trees = 1000,
  tree_depth = tune(),
  learn_rate = tune(),
  engine = "xgboost"
  )
classboost_mod
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1000
  tree_depth = tune()
  learn_rate = tune()

Computational engine: xgboost 
classtree_wf <- workflow() %>%
  add_recipe(sales_recipe) %>%
  add_model(classtree_mod) 
classtree_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()
  min_n = 5

Computational engine: rpart 
classrf_wf <- workflow() %>%
  add_recipe(sales_recipe) %>%
  add_model(classrf_mod)
classrf_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger 
classboost_wf <- workflow() %>%
  add_recipe(sales_recipe) %>%
  add_model(classboost_mod)
classboost_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1000
  tree_depth = tune()
  learn_rate = tune()

Computational engine: xgboost 
# Define tuning grids
library(dials)  

classtree_grid <- grid_regular(cost_complexity(), 
                               tree_depth(), 
                               levels = c(100, 5))
classtree_grid
# A tibble: 500 × 2
   cost_complexity tree_depth
             <dbl>      <int>
 1        1   e-10          1
 2        1.23e-10          1
 3        1.52e-10          1
 4        1.87e-10          1
 5        2.31e-10          1
 6        2.85e-10          1
 7        3.51e-10          1
 8        4.33e-10          1
 9        5.34e-10          1
10        6.58e-10          1
# ℹ 490 more rows
classrf_grid <- grid_regular(
  trees(range = c(100L, 500L)), 
  mtry(range = c(1L, 5L)),
  levels = c(5, 5)
  )
classrf_grid
# A tibble: 25 × 2
   trees  mtry
   <int> <int>
 1   100     1
 2   200     1
 3   300     1
 4   400     1
 5   500     1
 6   100     2
 7   200     2
 8   300     2
 9   400     2
10   500     2
# ℹ 15 more rows
classboost_grid <- grid_regular(
  tree_depth(range = c(1L, 3L)),
  learn_rate(range = c(-5, 2), trans = log10_trans()),
  levels = c(3, 10)
  )
classboost_grid
# A tibble: 30 × 2
   tree_depth learn_rate
        <int>      <dbl>
 1          1  0.00001  
 2          2  0.00001  
 3          3  0.00001  
 4          1  0.0000599
 5          2  0.0000599
 6          3  0.0000599
 7          1  0.000359 
 8          2  0.000359 
 9          3  0.000359 
10          1  0.00215  
# ℹ 20 more rows
#cv
# Set seed for reproducibility
set.seed(212)

# Define resampling
folds <- vfold_cv(Sales_other, v = 5)
folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [160/40]> Fold1
2 <split [160/40]> Fold2
3 <split [160/40]> Fold3
4 <split [160/40]> Fold4
5 <split [160/40]> Fold5
# Tune models
classtree_fit <- classtree_wf %>%
  tune_grid(
    resamples = folds,
    grid = classtree_grid,
    metrics = metric_set(accuracy, roc_auc)
    )
classtree_fit
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics             .notes          
  <list>           <chr> <list>               <list>          
1 <split [160/40]> Fold1 <tibble [1,000 × 6]> <tibble [0 × 3]>
2 <split [160/40]> Fold2 <tibble [1,000 × 6]> <tibble [0 × 3]>
3 <split [160/40]> Fold3 <tibble [1,000 × 6]> <tibble [0 × 3]>
4 <split [160/40]> Fold4 <tibble [1,000 × 6]> <tibble [0 × 3]>
5 <split [160/40]> Fold5 <tibble [1,000 × 6]> <tibble [0 × 3]>
classrf_fit <- classrf_wf %>%
  tune_grid(
    resamples = folds,
    grid = classrf_grid,
    metrics = metric_set(accuracy, roc_auc),
  )
classrf_fit 
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [160/40]> Fold1 <tibble [50 × 6]> <tibble [0 × 3]>
2 <split [160/40]> Fold2 <tibble [50 × 6]> <tibble [0 × 3]>
3 <split [160/40]> Fold3 <tibble [50 × 6]> <tibble [0 × 3]>
4 <split [160/40]> Fold4 <tibble [50 × 6]> <tibble [0 × 3]>
5 <split [160/40]> Fold5 <tibble [50 × 6]> <tibble [0 × 3]>
classboost_fit <- classboost_wf %>%
  tune_grid(
    resamples = folds,
    grid = classboost_grid,
    metrics = metric_set(accuracy, roc_auc)
    )
classboost_fit
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [160/40]> Fold1 <tibble [60 × 6]> <tibble [0 × 3]>
2 <split [160/40]> Fold2 <tibble [60 × 6]> <tibble [0 × 3]>
3 <split [160/40]> Fold3 <tibble [60 × 6]> <tibble [0 × 3]>
4 <split [160/40]> Fold4 <tibble [60 × 6]> <tibble [0 × 3]>
5 <split [160/40]> Fold5 <tibble [60 × 6]> <tibble [0 × 3]>

Visualize CV results:

classtree_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc") %>%
  mutate(tree_depth = as.factor(tree_depth)) %>%
  ggplot(mapping = aes(x = cost_complexity, y = mean, color = tree_depth)) +
  geom_point() + 
  geom_line() + 
  labs(x = "cost_complexity", y = "CV ROC AUC", color = "tree_depth") 
# A tibble: 1,000 × 8
   cost_complexity tree_depth .metric  .estimator  mean     n std_err
             <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
 1        1   e-10          1 accuracy binary     0.685     5  0.0232
 2        1   e-10          1 roc_auc  binary     0.629     5  0.0298
 3        1.23e-10          1 accuracy binary     0.685     5  0.0232
 4        1.23e-10          1 roc_auc  binary     0.629     5  0.0298
 5        1.52e-10          1 accuracy binary     0.685     5  0.0232
 6        1.52e-10          1 roc_auc  binary     0.629     5  0.0298
 7        1.87e-10          1 accuracy binary     0.685     5  0.0232
 8        1.87e-10          1 roc_auc  binary     0.629     5  0.0298
 9        2.31e-10          1 accuracy binary     0.685     5  0.0232
10        2.31e-10          1 roc_auc  binary     0.629     5  0.0298
   .config               
   <chr>                 
 1 Preprocessor1_Model001
 2 Preprocessor1_Model001
 3 Preprocessor1_Model002
 4 Preprocessor1_Model002
 5 Preprocessor1_Model003
 6 Preprocessor1_Model003
 7 Preprocessor1_Model004
 8 Preprocessor1_Model004
 9 Preprocessor1_Model005
10 Preprocessor1_Model005
# ℹ 990 more rows

classrf_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = trees, y = mean, color = factor(mtry))) +
  geom_point() + 
  # geom_line() + 
  labs(x = "Num. of Trees", y = "CV AUC")
# A tibble: 50 × 8
    mtry trees .metric  .estimator  mean     n std_err .config              
   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1     1   100 accuracy binary     0.725     5  0.0158 Preprocessor1_Model01
 2     1   100 roc_auc  binary     0.817     5  0.0318 Preprocessor1_Model01
 3     1   200 accuracy binary     0.77      5  0.0255 Preprocessor1_Model02
 4     1   200 roc_auc  binary     0.829     5  0.0321 Preprocessor1_Model02
 5     1   300 accuracy binary     0.765     5  0.0232 Preprocessor1_Model03
 6     1   300 roc_auc  binary     0.823     5  0.0294 Preprocessor1_Model03
 7     1   400 accuracy binary     0.78      5  0.02   Preprocessor1_Model04
 8     1   400 roc_auc  binary     0.831     5  0.0363 Preprocessor1_Model04
 9     1   500 accuracy binary     0.755     5  0.0278 Preprocessor1_Model05
10     1   500 roc_auc  binary     0.831     5  0.0345 Preprocessor1_Model05
# ℹ 40 more rows

classboost_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = learn_rate, y = mean, color = factor(tree_depth))) +
  geom_point() +
  labs(x = "Learning Rate", y = "CV AUC") +
  scale_x_log10()
# A tibble: 60 × 8
   tree_depth learn_rate .metric  .estimator  mean     n std_err
        <int>      <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
 1          1  0.00001   accuracy binary     0.675     5  0.0306
 2          1  0.00001   roc_auc  binary     0.638     5  0.0219
 3          2  0.00001   accuracy binary     0.68      5  0.0289
 4          2  0.00001   roc_auc  binary     0.692     5  0.0359
 5          3  0.00001   accuracy binary     0.73      5  0.0320
 6          3  0.00001   roc_auc  binary     0.731     5  0.0370
 7          1  0.0000599 accuracy binary     0.685     5  0.0232
 8          1  0.0000599 roc_auc  binary     0.650     5  0.0235
 9          2  0.0000599 accuracy binary     0.715     5  0.0257
10          2  0.0000599 roc_auc  binary     0.705     5  0.0308
   .config              
   <chr>                
 1 Preprocessor1_Model01
 2 Preprocessor1_Model01
 3 Preprocessor1_Model02
 4 Preprocessor1_Model02
 5 Preprocessor1_Model03
 6 Preprocessor1_Model03
 7 Preprocessor1_Model04
 8 Preprocessor1_Model04
 9 Preprocessor1_Model05
10 Preprocessor1_Model05
# ℹ 50 more rows

classbest_tree <- classtree_fit %>%
  select_best("roc_auc")
classbest_tree
# A tibble: 1 × 3
  cost_complexity tree_depth .config               
            <dbl>      <int> <chr>                 
1          0.0231          8 Preprocessor1_Model293
classbest_rf <- classrf_fit %>%
  select_best("roc_auc")
classbest_rf
# A tibble: 1 × 3
   mtry trees .config              
  <int> <int> <chr>                
1     5   400 Preprocessor1_Model24
classbest_boost <- classboost_fit %>%
  select_best("roc_auc")
classbest_boost
# A tibble: 1 × 3
  tree_depth learn_rate .config              
       <int>      <dbl> <chr>                
1          1     0.0774 Preprocessor1_Model16
# Final workflow
final_classtreewf <- classtree_wf %>%
  finalize_workflow(classbest_tree)
final_classtreewf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 0.0231012970008316
  tree_depth = 8
  min_n = 5

Computational engine: rpart 
final_classrfwf <- classrf_wf %>%
  finalize_workflow(classbest_rf)
final_classrfwf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 5
  trees = 400

Computational engine: ranger 
final_classboostwf <- classboost_wf %>%
  finalize_workflow(classbest_boost)
final_classboostwf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1000
  tree_depth = 1
  learn_rate = 0.0774263682681127

Computational engine: xgboost 
# Fit the whole training set, then predict the test cases
final_classtreefit <- 
  final_classtreewf %>%
  last_fit(data_split)
final_classtreefit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [200/200]> train/test split <tibble> <tibble> <tibble>     <workflow>
final_classrffit <- 
  final_classrfwf %>%
  last_fit(data_split)
final_classrffit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [200/200]> train/test split <tibble> <tibble> <tibble>     <workflow>
final_classboostfit <- 
  final_classboostwf %>%
  last_fit(data_split)
final_classboostfit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [200/200]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_classtreefit %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.755 Preprocessor1_Model1
2 roc_auc  binary         0.760 Preprocessor1_Model1
final_classrffit %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.8   Preprocessor1_Model1
2 roc_auc  binary         0.872 Preprocessor1_Model1
final_classboostfit %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.845 Preprocessor1_Model1
2 roc_auc  binary         0.918 Preprocessor1_Model1
# Visualize the final model
library(rpart.plot)
library(randomForest)
library(xgboost)
library(data.table)
final_classtree <- extract_workflow(final_classtreefit)
final_classtree
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
n= 200 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 200 82 Low (0.41000000 0.59000000)  
   2) ShelveLoc_Good>=0.6742344 44  9 High (0.79545455 0.20454545)  
     4) Price< 1.204802 36  3 High (0.91666667 0.08333333) *
     5) Price>=1.204802 8  2 Low (0.25000000 0.75000000) *
   3) ShelveLoc_Good< 0.6742344 156 47 Low (0.30128205 0.69871795)  
     6) Price< -0.9414209 26  8 High (0.69230769 0.30769231)  
      12) Income>=0.6401903 12  1 High (0.91666667 0.08333333) *
      13) Income< 0.6401903 14  7 High (0.50000000 0.50000000)  
        26) ShelveLoc_Medium>=-0.0499373 9  2 High (0.77777778 0.22222222) *
        27) ShelveLoc_Medium< -0.0499373 5  0 Low (0.00000000 1.00000000) *
     7) Price>=-0.9414209 130 29 Low (0.22307692 0.77692308)  
      14) Advertising>=0.78991 26 12 High (0.53846154 0.46153846)  
        28) Price< 0.8505711 20  6 High (0.70000000 0.30000000)  
          56) CompPrice>=-0.787038 16  2 High (0.87500000 0.12500000) *
          57) CompPrice< -0.787038 4  0 Low (0.00000000 1.00000000) *
        29) Price>=0.8505711 6  0 Low (0.00000000 1.00000000) *
      15) Advertising< 0.78991 104 15 Low (0.14423077 0.85576923)  
        30) CompPrice>=1.238547 15  6 Low (0.40000000 0.60000000)  
          60) Income>=0.247091 6  1 High (0.83333333 0.16666667) *
          61) Income< 0.247091 9  1 Low (0.11111111 0.88888889) *
        31) CompPrice< 1.238547 89  9 Low (0.10112360 0.89887640) *
final_classtree %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

final_classrf <- extract_workflow(final_classrffit)
final_classrf
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5L,      x), num.trees = ~400L, num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  400 
Sample size:                      200 
Number of independent variables:  11 
Mtry:                             5 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.1495906 
final_classboost <- extract_workflow(final_classboostfit)
final_classboost
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_naomit()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 744.2 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.0774263682681127, max_depth = 1L, 
    gamma = 0, colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 1000, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
  eta = "0.0774263682681127", max_depth = "1", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 11 
niter: 1000
nfeatures : 11 
evaluation_log:
    iter training_logloss
       1        0.6789387
       2        0.6667352
---                      
     999        0.1727765
    1000        0.1727029
boost_model <- final_classboost$fit$fit$fit
importance <- xgb.importance(model = boost_model)
importance_df <- as.data.table(importance)
xgboost::xgb.plot.importance(importance_matrix = importance_df)

library(vip)

final_classtree %>% 
  extract_fit_parsnip() %>% 
  vip()

final_classboost %>%
  extract_fit_parsnip() %>%
  vip()